Erin M. Buchanan
Last Updated: 2021-01-01
ANLY 500 will focus on the foundations of:
The first question we must ask ourselves in this course is: What is Analytics?
We should note analytics can be defined in two ways!
The utilization of:
The focus of data analytics can be defined under three scopes, including:
Dataset: “Sunspot Trends from 1749-01-01 to 2013-09-01’”
Description: Understand the Historical Trend of Sunspots from 1749 to 2013.
## [1] 58.0 62.6 70.0 55.7 85.0 83.5
## Time-Series [1:3177] from 1749 to 2014: 58 62.6 70 55.7 85 83.5 94.8 66.3 75.9 75.5 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 15.70 42.00 51.96 76.40 253.80
library(ggplot2)
sunspot.month <- as.data.frame(sunspot.month)
sunspot.month$Time <- 1:nrow(sunspot.month)
ggplot(sunspot.month, aes(x = Time, y = x)) +
geom_point(alpha = 0.5) +
ylab("Number of Sunspots") +
xlab("Time") +
theme_classic()library(quantmod)
start <- as.Date(Sys.Date()-(365*5))
end <- as.Date(Sys.Date()-2)
getSymbols("AMZN", src = "yahoo", from = start, to = end)## [1] "AMZN"
## An 'xts' object on 2016-01-04/2020-12-29 containing:
## Data: num [1:1257, 1:6] 656 647 622 622 620 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:6] "AMZN.Open" "AMZN.High" "AMZN.Low" "AMZN.Close" ...
## Indexed by objects of class: [Date] TZ: UTC
## xts Attributes:
## List of 2
## $ src : chr "yahoo"
## $ updated: POSIXct[1:1], format: "2021-01-01 22:40:50"
predictive_model <- lm(formula = AMZN.Close ~ AMZN.High + AMZN.Low + AMZN.Volume,
data = AMZN[1:1199,])
summary(predictive_model)##
## Call:
## lm(formula = AMZN.Close ~ AMZN.High + AMZN.Low + AMZN.Volume,
## data = AMZN[1:1199, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -99.653 -5.406 -0.195 5.632 100.519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.971e-01 1.636e+00 0.12 0.904
## AMZN.High 4.799e-01 2.495e-02 19.23 <2e-16 ***
## AMZN.Low 5.210e-01 2.564e-02 20.32 <2e-16 ***
## AMZN.Volume 1.620e-08 2.714e-07 0.06 0.952
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.93 on 1195 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9995
## F-statistic: 7.93e+05 on 3 and 1195 DF, p-value: < 2.2e-16
par(mfrow=c(2,3))
plot(predictive_model,1)
plot(predictive_model,2)
plot(predictive_model,3)
plot(predictive_model,4)
plot(predictive_model,5)n <- length(AMZN[,1])
prediction <- stats::predict(predictive_model, AMZN[1200:n,])
tail(data.frame(prediction))## prediction
## 2020-12-21 3198.146
## 2020-12-22 3203.073
## 2020-12-23 3199.503
## 2020-12-24 3187.688
## 2020-12-28 3238.624
## 2020-12-29 3317.538
Analytics is the discovery, interpretation and communication of meaningful patterns or summary of data using data analytics.
Now we should be asking the question: What is Data Analytics?
High level analysis techniques commonly used in data analytics include:
However, two other types of analysis may be considered.
Quantitative data analysis: involves analysis of numerical data with quantifiable variables that can be compared or measured statistically.
Qualitative data analysis: it is more interpretive. It focuses on understanding the content of non-numerical data like text, images, audio and video, including common phrases, themes and points of view.
In other words, formulate a question that needs to be answered.
Test the concept:
Theory:
Hypothesis:
Falsification:
Independent Variable:
Dependent Variable:
Data is a set of values/measurements of quantitative or qualitative variables.
In a dataset, we can distinguish two types of variables:
Definition - entities that are divided into distinct categories.
Includes the following:
R stores categorical variables as a factor or character.
Factors are the variables in R which take on a limited number of different values.
Definition - a binary variable is only two categories.
Definition - A nominal variable is more than two categories.
Definition - A ordinal variable is the same as a nominal, but the categories have a logical order.
In addition to being able to classify values into categories, you can order the categories: first, second, third
Definition - entities get a distinct score.
Includes the following:
Definition - A interval variable is equal intervals on the variable. It represents equal differences in the property being measured. This variable also does not have a true zero.
Definition - A ratio variable is the same as an interval variable, but the ratios of scores on the scale must also make sense. This variable does have a true zero.
The accuracy of the measurements are key to your solutions.
Measurement Error: - aka observational error
Definition - The discrepancy between the actual value we’re trying to measure, and the number we use to represent that value.
Validity:
Including the following:
Reliability:
Test-Retest Reliability:
To use measures in any research and test them we must now understand the following: How to Measure?
It is different for certain types of research, including:
Definition - One or more variables is systematically manipulated to see their effect (alone or in combination) on an outcome variable.
Cause and Effect (Hume, 1748)
Confounding variables: the ‘Tertium Quid’
Ruling out confounds (Mill, 1865)
Considering the what & how to measure, we must now look at the methods of data collection.
For instance:
Between-group/between-subject/independent
Repeated-measures (within-subject)
Systematic Variation
Unsystematic Variation
Randomization
First, populations and samples should be understood so that your analysis is not misleading when interpreting results.
Population
Sample
A simple statistical model can be used to analyze data.
For instance, the mean is a hypothetical value.
## setosa versicolor virginica
## 5.006 5.936 6.588
The numbers estimated from a single test/study/experiment are considered a sample.
Parameters = Greek Symbols
Statistics = Latin Letters
## setosa versicolor virginica
## 5.157143 5.700000 6.560000
## setosa versicolor virginica
## 5.006 5.936 6.588
To analyze the data and generate interpretable results the following statistical models can be used:
In this lecture, you have learned: